-
Notifications
You must be signed in to change notification settings - Fork 1k
Add short term storage expiration indicator to history items #20332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Add short term storage expiration indicator to history items #20332
Conversation
I'm anxious about this idea for a few reasons but Anton is the boss 🤷♀️. If you descend into collections in the history panel - do you get the icon on individual datasets there? The query that summarizes states across the whole collection could accumulate the object store IDS at the same time - it would be a wild query but it would probably the easiest and correct thing to do to summarize the dataset collection. I guess we couldn't get a count down in that case but we could add a storage temp storage icon with more information like a per dataset break down by clicking on it. |
I understand and share your concerns, especially with collections. I also think the other proposed solutions, like sending emails, are even more concerning. It would be really hard to do it right and not turn it into a massive spam generator, so likely not worth it 😅
I would say no. My idea was to do something less accurate, but enough to "inform" the user that the datasets or collections used will be temporal. I thought displaying something a the top level would be enough, if you drilled down that collection, you must have already seen the "indication" and we could still display it at the top. I know it is technically possible to mix elements from different object stores in the same collection, but will this be a common case? I was hoping we could assume a single common object store for the HDCA by peeking into just one of its datasets. But yeah, in the worst case, we could do what you suggest, aggregating the object store IDs in the summarize query, and if there is at least one object store ID known to be short-term, just display some warning at the top. This would probably already be a huge improvement in raising awareness of the temporary nature of the selected storage without needing many more features. |
And we're certain we cannot just take scratch away from people who complain? We "promote" them to a "higher tier" of user where all there data is permanent storage and advanced options are disabled. Not going to fly huh?
It is probably uncommon but they are pretty easy to create and it would be my guess that they would be more common/have more obvious use cases than say mixing dbkeys or file extensions and we deal with a mix of those in the UI in a mostly "correct" fashion. |
The other option is not show the indicator at all for collections, but only when the user drills down to the dataset level. |
I made an attempt to include the set of Let me know if this is still a bad idea 😅 |
c5b788c
to
86935aa
Compare
86935aa
to
a0fd976
Compare
I've run benchmarks on three different dataset collections: 1K, 5K, and 10K datasets. For each collection, I issued 100 requests to the endpoint:
and recorded the minimum, maximum, and average response times (in milliseconds). Without adding the
|
Collection | Min (ms) | Max (ms) | Avg (ms) |
---|---|---|---|
1K | 25.85 | 81.60 | 44.77 |
5K | 59.69 | 110.31 | 74.90 |
10K | 107.08 | 182.18 | 125.92 |
With the object_store_ids field (the changes proposed in d105def):
Collection | Min (ms) | Max (ms) | Avg (ms) |
---|---|---|---|
1K | 26.84 | 56.01 | 38.92 |
5K | 66.64 | 128.78 | 80.28 |
10K | 119.39 | 171.10 | 137.09 |
There is a slight increase in response time for larger collections, but maybe it's worth the tradeoff?
On the other hand, I noticed there is still an inaccuracy in this approach. This tracks all the object_store_ids, but it "assumes" the creation_time of the whole HDCA is the time for calculating the expiration, but it should consider "the oldest creation_time for each dataset in those object_stores" instead.
I will try to explore and benchmark adding the oldest creation_time to each object_store_id and see what we get...
2213bb5
to
1b8acc2
Compare
The new approach for collections in 1b8acc2 is more accurate as it takes into account the "oldest create_time" of the datasets associated with each object store used in the collection. Of course, it is slightly slower too, but again, it may be worth the extra time.
Average Response Time Comparison (ms)
|
This is an optional property that can indicate the number of days that an object (file) will be stored in a short term storage.
To display expiration status of datasets stored in a short term object store.
Reusing the same query for dbKeys and extensions, we get a unique set of object_store_ids where the elements of the collection are stored.
In case of multiple object stores, we pick the one with the shortest expiration time as we can assume that as soon as the first element expires, the entire collection should be considered "expired" since we cannot access all elements anymore.
And update tests
I don't remember exactly why this was set to optional, but It seems the default of the database field will always be datetime.now so it makes more sense to make the value required.
…ired test To handle mock datasets when serializing the collection during export
This provides an accurate expiration date for collections with mixed object stores and creation dates for its datasets.
076d536
to
41eaaf6
Compare
xref #20169
This simple approach should not be too expensive and can help the user identify when a dataset might be gone because it is stored in a short-term object store.
This works just by annotating the object store config with a new property
object_expires_after_days
:There are still some drawbacks to consider/resolve:
object_expires_after_days
with the actual expiration time of the object store. It seems the cleanup of the object store is handled by external processes, so this value must be in sync with the actual expiration time of the object store.object_store_id
property. I wonder if we could "estimate" or "assume" the object store ID of a collection by looking at the object store ID of the first dataset in the collection. This is not ideal, but maybe it could be a good enough workaround? I'm not sure how often collection elements are stored in mixed object stores, but I guess it could happen.How to test the changes?
(Select all options that apply)
License